Dynamic Document Clustering Using Singular Value Decomposition

نویسندگان

  • Rashmi Nadubeediramesh
  • Aryya Gangopadhyay
چکیده

Incremental document clustering is important in many applications, but particularly so in healthcare contexts where text data is found in abundance, ranging from published research in journals to day-to-day healthcare data such as discharge summaries and nursing notes. In such dynamic environments new documents are constantly added to the set of documents that have been used in the initial cluster formation. Hence it is important to be able to incrementally update the clusters at a low computational cost as new documents are added. In this paper the authors describe a novel, low cost approach for incremental document clustering. Their method is based on conducting singular value decomposition (SVD) incrementally. They dynamically fold in new documents into the existing term-document space and dynamically assign these new documents into pre-defined clusters based on intra-cluster similarity. This saves the cost of re-computing SVD on the entire document set every time updates occur. The authors also provide a way to retrieve documents based on different window sizes with high scalability and good clustering accuracy. They have tested their proposed method experimentally with 960 medical abstracts retrieved from the PubMed medical library. The authors’ incremental method is compared with the default situation where complete re-computation of SVD is done when new documents are added to the initial set of documents. The results show minor decreases in the quality of the cluster formation but much larger gains in computational throughput. Dynamic Document Clustering Using Singular Value Decomposition

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members

Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...

متن کامل

Change Point Estimation of the Stationary State in Auto Regressive Moving Average Models, Using Maximum Likelihood Estimation and Singular Value Decomposition-based Filtering

In this paper, for the first time, the subject of change point estimation has been utilized in the stationary state of auto regressive moving average (ARMA) (1, 1). In the monitoring phase, in case the features of the question pursue a time series, i.e., ARMA(1,1), on the basis of the maximum likelihood technique, an approach will be developed for the estimation of the stationary state’s change...

متن کامل

Document Clustering: Before and After the Singular Value Decomposition

Document Clustering is an issue of measuring similarity between documents and grouping similar documents together. Information Retrieval (IR) is an issue of comparing query with a collection of documents to locate a set of documents relevant to a particular query. In the vector space IR model, a query is treated as a document which consists of a few terms. Therefore, in both clustering and retr...

متن کامل

A Multi-Document Multi-Lingual Automatic Summarization System

Abstract. In this paper, a new multidocument multi-lingual text summarization technique, based on singular value decomposition and hierarchical clustering, is proposed. The proposed approach relies on only two resources for any language: a word segmentation system and a dictionary of words along with their document frequencies. The summarizer initially takes a collection of related documents, a...

متن کامل

Hierarchical Document Clustering Using Correlation Preserving Indexing

This paper presents a spectral clustering method called as correlation preserving indexing (CPI). This method is performed in the correlation similarity measure space. Correlation preserving indexing explicitly considers the manifold structure embedded in the similarities between the documents. The aim of CPI method is to find an optimal semantic subspace by maximizing the correlation between t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCMAM

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2012